Levy scaling: the Diffusion Entropy Analysis applied to DNA sequences 
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We address the problem of the statistical analysis of a time series generated by complex dynamics 
with a new method: the Diffusion Entropy Analysis (DEA) (Fractals, 9, 193 (2001)). This method is 
based on the evaluation of the Shannon entropy of the diffusion process generated by the time series 
imagined as a physical source of fluctuations, rather than on the measurement of the variance of this 
diffusion process, as done with the traditional methods. We compare the DEA to the traditional 
methods of scaling detection and we prove that the DEA is the only method that always yields the 
correct scaling value, if the scaling condition applies. Furthermore, DEA detects the real scaling of 
a time series without requiring any form of de-trending. We show that the joint use of DEA and 
variance method allows to assess whether a time series is characterized by Levy or Gauss statistics. 
We apply the DEA to the study of DNA sequences, and we prove that their large-time scales are 
characterized by Levy statistics, regardless of whether they are coding or non-coding sequences. We 
show that the DEA is a reliable technique and, at the same time, we use it to confirm the validity 
of the dynamic approach to the DNA sequences, proposed in earlier work. 
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I. INTRODUCTION 

The recent progress in experimental techniques of 
molecular genetic shas made available a wealth of genome 
data (see, for example, Ref. 0), and raised the interest 
for the statistical analysis of DNA sequences. The pi- 
oneer papers mainly focused on the controversial issue 
of whether long-range correlations are a property shared 
by both coding and non-coding sequences or are only 
present in non-coding sequences The results of 

more recent papers yield the convincing conclusion 
that the former condition applies. However, some sta- 
tistical aspects of the DNA sequences are still obscure, 
and it is not yet known to what an extent the dynamic 
approach to DNA sequences proposed by the authors of 
Ref. II is a reliable picture for both coding and non- 
coding sequences. The later work of Refs. Q and 
established a close connection between long-range cor- 
relations and the emergence of non-Gaussian statistics, 
confirmed by Mohanty and Narayana Rao Q . According 
to the dynamic approach of Refs. |8||llJ this non-Gaussian 
statistics should be Levy, but this property has not yet 
been assessed with compelling evidence. The reason for 
the confusion affecting this issue is deeper than one can 
imagine, since it essentially depends on the fact the there 
exists no reliable method of scaling detection. In fact, all 
the traditional methods of scaling detection on the mar- 
ket, the Detrended Fluctuation Analysis (DFA) Q, the 
Standard Deviation Analysis (SDA) , and the Wavelets 
Spectral Analysis (WSA) J7|JT^], are based on the evalu- 
ation of the variance of the process, and therefore yield 



a scaling that is the correct one only if the process under 
study is Gaussian. 

The main purposes of this paper are: 

1) To clarify the meaning of scaling as a form of ther- 
modynamic equilibrium that can be reached after a long 
time transient, throughout which the conventional tech- 
niques of analysis can yield misleading information. 

2) To show that a new method, the Diffusion Entropy 
Analysis (DEA), recently proposed in Ref. (m), is able 
to yield the correct scaling, even when the observed dif- 
fusion process is not Gaussian. We shall show that the 
departure of the correct scaling, detected by means of the 
DEA, from the results of the traditional methods, all of 
them being variance-based methods, is a clear indication 
of the non-Gaussian character of the process under study. 

3) To show the DEA in action by means of an appli- 
cation to the study of DNA sequences. As a remarkable 
result, we shall show that both coding and non-coding 
DNA sequences depart from Gaussian statistics and pro- 
duce Levy diffusion. This will shed light on some still 
obscure aspects of the statistical properties of DNA. 



II. THE MEANING OF SCALING 

The reason for the confusion still present in the issue of 
the extraction of the long-range statistical properties of 
DNA sequences (and more in general of any time series: 
heartbeats, earthquakes, oscillations of markets stocks 
etc.) is essentially due to the fact the there are no reli- 
able methods of scaling detection. To clarify this crucial 
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aspect we need to discuss, first, what scaling is all about. 
Scaling is a property of a probability distribution p{x, t), 
which formally reads as: 

When we deal with a time series or a generic sequence we 
need first to construct the probability distribution p{x, f) . 
In order to do so we convert with some method, for in- 
stance the one used in this paper, the single sequence 
into many distinct trajectories. These trajectories start 
at time t = from x — 0, and then spread over the 
x-axis, as a result of their, partial or total, random na- 
ture. Thus, rather than observing a single trajectory, we 
are naturally led to evaluate the probability of observing 
it. In other words, we rest, with theoretical or computa- 
tional arguments, on the probability of finding the vari- 
able x in the interval [x, x + dx] at time t, denoted by us 
as P(x,dx,t). The probability density, p(x,t), is defined 
by P(x,dx,t)/dx . The meaning of Eq.(Q) is that the 
process is stationary, in spite of the fact that the prob- 
ability density p(x,t) broadens with time. To stress this 
aspect, let us focus our attention on the probability den- 
sities p(x, t\) and p(x, tz), at two distinct times t\ and t%, 
with t\ <t%. Let us squeeze the abscissa scale of the later 
distribution, p(x, is), by the factor R = (ti/ia) 15 < 1, and 
then enhance the intensity of the resulting distribution 
density by multipliying it by the factor 1/R > 1. If the 
property of Eq.(|l|) holds true, then the resulting distri- 
bution density is identical to the former, p(x, t\). This 
is equivalent to interpreting the distribution density as a 
form of equilibrium distribution. This property is deeply 
related to the foundation itself of statistical mechanics 
p5| . In fact, in the case where the diffusion trajectory 
is the superposition of many uncorrelated fluctuations, 
the resulting diffusion process is predicted by the Cen- 
tral Limit Theorem (CLT) to be a Gaussian probability 
distribution, a special form of canonical equilibrium, and 
we can refer ourselves to the transient process necessary 
for the CLT to work as a kind of transition from dynamics 
to thermodynamics. In this sense the scaling property of 
Eq.(^) must be interpreted as a form of thermodynamic 
equilibrium. Note that in the case of ordinary statistical 
mechanics, when the CLT applies, we have that 8 = 1/2 
and F(y) is a Gaussian function of y. 

According to the new field of Science of Complexity 
[^6|,0, a complex process is expected to yield the prop- 
erty of Eq.(0) with d ^ 1/2 and (or) F(y) being a form 
different from the Gaussian one (we shall discuss an ex- 
ample of this non-Gaussian form in later sections). Thus, 
this raises the question of whether a non canonical equi- 
librium condition can be generated by sequences reflect- 
ing complex dynamics. We should consider three differ- 
ent possibilites: 

1) Mandelbrot Jl7| proposes Fractional Brownian Mo- 
tion (FBM) as a condition exceeding the limits of ordi- 



nary statistical mechanics. This corresponds to the scal- 
ing condition of Eq.(|l]) with 5^1/2 while F(y) keeps its 
Gaussian form. 

2) Another possible form of violation, naturally stem- 
ming from the Generalized Central Limit Theorem 
(GCLT) @, rests on Eq.(|) with 6 > 1/2 and F(y) 
being a Levy function, with the asymptotic property 
lim y - f00 F{y) — const j 'y 1+1/ ' & . This means the occur- 
rence of a disconcerting condition, where the second mo- 
ment of the distribution is infinite. It is obvious that 
in practice real time series cannot produce this condi- 
tion, and that the distribution moments of the observed 
diffusion process are always finite, being an imperfect re- 
alization of the diffusion process with infinite moments. 

3) Finally, we should consider also the stretched Gaus- 
sians emanating from subdiffusion [^9|. Actually, this 
kind of process is not explicitly examined in this paper. 
We expect that in this case the standard techniques of 
scaling dectection might do better than in case 2), since 
the stretched Gaussians are characterized by finite mo- 
ments. Therefore, we shall focus our attention on both 
case 1), where the standard techniques are expected to 
yield exact results, and on case 2), where the standard 
techniques are expected to fail. 

As we shall show in this paper, all techniques cur- 
rently adopted to detect scaling are explicitly or implic- 
itly based on the measurement of the second moment of 
the distribution p(x,t). Thus, the scaling revealed by 
the ordinary techniques of analysis might depart from 
the genuine scaling of the process under observation, if 
this is an imperfect realization of a diffusion process with 
infinite moments. To stress this crucial aspect we adopt 
for the scaling parameter 5 the symbol H, according to 
a notation proposed by Mandelbrot to honor Hurst pC| ] 
(see also Ref. |2lJ ) . Notice that a widely adopted method 
to express the condition of Eq. ([l]) is given by 

xott 5 . (2) 

This way of expressing the scaling condition is the source 
of misleading procedures. In fact it is usually assumed 
that it is equivalent to 

/+oo 
x 2 p(x,t)dx cx t s . (3) 
-00 

We think that it is much more appropriate to use the 
following notation 

< x 2 (t) > 1 /2 0C t H , (4) 

leaving open the possibility that H ^= 5. 

In this paper we show that the Diffusion Entropy Anal- 
ysis (DEA) [ pif is the only technique yielding the cor- 
rect scaling S when the observed diffusion process de- 
parts from the FBM condition. In fact all the other 
techniques, including the Detrended Fluctuation Analy- 
sis (DFA) Q, the Standard Deviation Analysis (SDA) 



2 



§, and the Wavelets Spectral Analysis (WSA) (7|,|13|, 
yield a scaling that would be correct only in the FBM 
case. This is so because, as we shall see, these techniques 
rest on variance to evaluate scaling. All these techniques, 
whose limitations are bypassed by the DEA, are in a sense 
different versions of the same method, to which we shall 
refer to as the Variance Method ( VM) . The departure of 
the correct scaling, revealed by the DEA, from the results 
of the VM is consequently a proof of the non-Gaussian 
character of the process under study. 



III. THE DIFFUSION ENTROPY ANALYSIS 
(DEA) 

The Diffusion Entropy Analysis (DEA) is based upon 
the direct evaluation of the Shannon entropy of the dif- 
fusion process. In the continuous-space and continuous- 
time representation for the probability density p(x, t), the 
Shannon entropy p2] of the diffusion process reads 



S(t) 



dxp(x, t) \n[p{x, t)]. 



(5) 



To show how the DEA works, let us assume that p(x, t) 
fits the scaling condition of Eq.(|l|). Let us plug Eq.(|l]) 
into Eq.(||). After a simple algebra, we get: 



where 



and 



S(r) = A + St, 



/OO 
dyF(y) \n[F(y)} 
-OO 



T = ln(i). 



(G) 



(7) 



(8) 



Eq. (|6|) shows that if the diffusion process scales with the 
parameter 8, the resulting diffusion entropy becomes a 
linear function of the logarithm of t, with a slope equal 
to 6. This makes the slope measurement equivalent to the 
scaling detection, independently of the form of F(y) . 

In the case of ordinary Brownian diffusion, 5 = 1/2 
and F(y) has the following Gaussian form 



F(y) 

Thus Eq.(||) becomes 



exp 



(-£) 



V2~t 



SW = ^[l + m(2™ 2 )]+i m (t). 



(9) 



(10) 



In this case, we have assumed the system to be already 
in the scaling regime state. More in general, we shall 
have to address the problem of the transition from the 
dynamic to the thermodynamic (scaling) regime. 



IV. LEVY WALK 

The artificial sequences that we shall use in this pa- 
per to show the merits of DEA and the limits of VM 
rests on a dynamic approach adopted years ago to derive 
Levy statistics |ll],^3| • The importance of this approach 
to Levy statistics is due the fact that it makes possible, 
in principle, to use the same perspective as that adopted 
in Ref. Q. Bianucci et al. Q discussed the case of a 
system of interest interacting with another system called 
booster rather than thermal bath, to emphasize that no 
assumption on its thermodynamic nature was done. The 
basic aspect of the research project of Ref. was that 
statistical mechanics, in that case ordinary statistical me- 
chanics, had to be derived from merely dynamic rather 
than thermodynamic arguments. The same approach can 
be applied to the derivation of Levy statistics, with only 
one significant difference: the phase space of the booster 
rather than being fully chaotic, as in the case of ordi- 
nary statistical mechanics, is weakly chaotic p5[. The 
phase space consists of chaotic and regular regions, and 
the booster trajectory tends to sojourn for a long time 
at the border between chaotic and ordered regions. The 
waiting time distribution is an inverse power law, and, 
for simplicity, we assume it to be given by 



^(i) = 0u-l) 



We make the assumption 



rpfl- 



(t + ty ' 



M > 2, 



(11) 



(12) 



which ensures the mean waiting time tm to get the finite 
value 



T 



(M-2) 



(13) 



It is evident from this formula that the parameter T, as 
well as the power index fi, determine the time duration 
of the sojourn of the trajectory at the border between 
chaotic and ordered regions. This inverse power law form, 
and the resulting stickiness, are naturally generated by 
the self-similar nature of the borders [^5| . We call these 
crucial subsets of the phase space fractal borders. 

Now, let us assume that one of the variables of the 
phase space, called £, is the generator of the fluctuations 
that are collected by the diffusing variable x. Since the 
fractal borders have a finite size, when the trajectory 
sticks to one fractal border, the variable £ gets a value 
that depends on the trajectory position. Let us make 
also the assumption that there are only two fractal bor- 
ders, and that their size compared to that of the whole 
phase space is so small that the variable £ gets only two 
distinct values, denoted by us as W and —W. As an 
example of Hamiltonian model generating velocity fluc- 
tuations we have in mind the kicked rotor in the so called 
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accelerating state [|26|-|28|. The booster trajectory moves 
erratically in the chaotic sea between the two fractal re- 
gions, and after a given time sticks to one of the two frac- 
tal regions. After an extended time spent in this fractal 
region it goes back to the chaotic sea, and after a short 
diffusion process, it either goes back to the earlier frac- 
tal region or it goes to the other one. Due to the power 
law nature of the waiting time distribution of Eq. (11), 



the sojourn in the chaotic sea can be ignored. As a re- 
sult of this dynamic process we shall get a sequence such 

as W,W,W,W,...- W, -W, -W, -W, W,W,W, In 

this paper we set W = 1. This is an example of the 
time series under discussion in this paper. For simplic- 
ity, rather than deriving it running a dynamic system, 
as the kicked rotor in the accelerating state |2q-|2q|, we 
can directly generate the random sequence {Ti,^i} in the 
following way: first the numbers Tj are randomly drawn 
from the the distribution of Eq. ([ll]) ; then the value of £j 
is established by tossing a coin, and it is assumed that 
the variable £ gets the specific value £j for the whole time 
interval Ti. 

To understand the connection between this kind of se- 
quence and Levy statistics, we have to use the fluctuation 
£ to generate diffusion by means of the following equation 
of motion: 



x (t) = m 



(14) 



As remarked earlier, £ is a dichotomous variable, i.e. £ = 
±1, where 1 is a unit of length. The solution of ( [[ij ) is 
given by 



x(t) = x(0) + / dt' 



(15) 



and our final goal is to evaluate < x 2 (t) >. 

As pointed out by Zaslavsky p5|] , the condition \i > 2, 
assumed throghout this paper (see Eq.([l2|)), ensures the 
stationary condition, which allows us to properly define 
$j(i), the normalized correlation function of the fluctu- 
ation £. This important dynamic property, according to 
the renewal theory (29|, is related to ip{t) by 



**(*) = — r(t'-tm>)dt\ 

TM Jt 



(16) 



where tm denotes the mean waiting time. Using for ip(t) 
the expression of Eq. dill) we obtain 



T 



t + T 



p.-2 



(17) 



In this case tm is given by Eq.(|13|). Squaring the expres- 
sion for x(t) given byEq.@ and by using the stationary 
and dichotomous nature of the fluctuation £(t) = ±1 (the 
latter yielding < £ 2 >= 1), it is easy to prove that the 
mean square displacement < x 2 (t) > is given by 



dt 



— < x 2 (t) >= 2 / dt' <S>z(t-t') . 



Finally, by using Eq.(17) we get: 



lim < x (t) >cx t 

t — >oo 



2H 



with 



and 



H = 



H = - 
2 



if M < 3, 



if M > 3 



(18) 



(19) 



(20) 



(21) 



It is therefore evident that fj, = 3 is the border between 
ordinary and anomalous diffusion. As pointed out in Sec- 
tion [n| this result can be trusted only in the Gaussian 
case. 

Let us see why this way of evaluating scaling needs 
some caution. Thank to the condition of Eq.(|l2|), we 
can define the number N = [</tm], where [y] denote 
the integer part of y. In the case t » tm the number 
N becomes virtually identical to the number of random 
drawings of the numbers 7j and This is equivalent to 
drawing the N numbers rji = ^Tj. 

1) In the case where the condition /i > 3 applies, this 
distribution has a finite second moment. Thus, we can 
use the Central Limit Theorem (CLT), which yields a 
Gaussian diffusion, and consequently, H = 1/2, which 
correctly reflects the scaling in this case. 

2) In the case 2 < fi < 3, the second moment of this 
distribution is divergent, thereby preventing us from us- 
ing the CLT. However, in this case we use the General- 
ized Central Limit Theorem (GCLT) |18|. As shown in 
Ref. |3(]] , this random extraction of numbers yields a dif- 
fusion process, described by the probability distribution 
Pi,(x,t), whose Fourier transform, pu(k,t), reads 



with 



p L (k,t) = exp{b\k\»-H) (22) 



b = W{TWy- 2 sin[iv(n - 2)/2]r(3 - p). (23) 



The subscript L stands for Levy. The numerical simula- 
tions support this theoretical expectation |30| . Note that 
this dynamic approach to Levy statistics coincides with 
the Levy walk (29). The difference between Levy walk 
and Levy flight is well known. In the case of Levy flight 
the random walker makes instantaneously jumps of arbi- 
trary intensity. In the case of Levy walk, instead, it takes 
the random walker a time proportional to \r)i\ to make a 
jump with this intensity. In the case of Levy flight, the 
random walker makes jumps of intensity |?7i| at regular 
time intervals. 
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We note that the scaling of Eq.([j]) derives naturally 
from the joint use of the assumption x oc t s and norm 
conservation. It is straightforward to show that within 
the Fourier representation the norm conservation yields 
Pi(0,t) = 1. On the other hand, moving from to 
|k| = l/cli 1 /^ -1 ) we obtain the time independent Fourier 
transform pu{k) — exp{— which fits the normal- 
ization condition, and yields the scaling 



5 = 



V - 



1 ' 



(24) 



which has to be compared to Eq. (|2C|) . It is evident that 
H 5, in this case. 

In this paper, we shall focus our attention on the dy- 
namic condition fitting both the condition of Eq.(pl 
(i > 2, and the condition 



/! < 3. 



(25) 



This is in line with the arguments of the dynamic ap- 
proach to DNA of the earlier work of Refs. fs|-|ll]|, which 
proved the DNA sequences to be equivalent to a dynamic 
process fitting both conditions, ensuring stationarity, the 
former, and superdiffusion, the latter, at the same time. 

There are two important issues to clarify before pro- 
ceeding with the next sections. The reader can find a 
detailed account somewhere else Jll],||l],[32|] . However, to 
make this paper as much selfcontained as possible, we 
shall shortly outline both of them. The first issue has to 
do with the time required for the GCLT to apply. The 
work of Ref. ]3l| shows that the predictions of the GCLT 
are realized by the following expression for p{x, t) : 



p(x, t) = K(t)p T (x, t)9(Wt - \x\) 



Wt)I p {t). 
(26) 



where pT{x,t) is a distribution that for t — > oo becomes 
identical to the Levy probability distribution of the vari- 
able x, namely a function whose Fourier transform coin- 
cides with Eq. ( p2|) , denotes the Heaviside step function 
and K(t) is a time-dependent factor ensuring the nor- 
malization of the distribution p{x,t). This contribution 
to Eq.(|26|) is a truncated Levy distribution, the ratio- 
nale for it being that no trajectory can travel with veloc- 
ity of intensity larger than W . The trajectories that at 
time t > are still travelling in the same direction as at 
time t — produce two peaks located at the propagation 
fronts, x = Wt and x — — Wt, and their contribution to 
p(x, t) is given by the second term on the right hand side 
of Eq. (p6|) . The number of trajectories that contribute 
to the peaks is given by the function I p that has been 
evaluated in detail by the authors of Ref. |3l| . Here it is 
enough to say that these authors find 



lim {l p (t)-<i> i (t)} = 0. 



(27) 



This means that in the time asymptotic limit the peak 
intensity becomes identical to the correlation function 
$^(t) of Eq.(|l7j). On the basis of these arguments they 
reach the conclusion that in the asymptotic time limit 
Eq.(p6|) becomes identical to 

p(x, t) = p L (x, t)6(Wt - \x\) + ±8(\x\ - Wt)*t(t), (28) 

which coincides with the earlier prediction of Ref. JTl| . 
This conclusion seems to be compatible with the results 
obtained by using the theory of Continuous Time Ran- 
dom Walk (CTRW) Q, although these authors do no 
refer explicitly to the correlation function $f (t). For an 
earlier work based on the CTRW see Ref. |54| 

To provide an answer to the first question it is enough 
to rest on the earlier result of Eq. ( |28| ) . It takes an infi- 
nite time for the GCLT to apply: in fact the intensity of 
the peaks of the propagation front is proportional to the 
correlation function of Eq.(p7j), which is not integrable. 
During this long transient, as we shall see, the DEA gets 
closer and closer to the true scaling of Eq. ( pi| ) , while the 
distribution second moment, which is finite due to the 
truncation of the Levy distribution, yields the fake scal- 
ing of Eq.©. 

The second issue is less relevant to the main purpose 
of this paper. It has to do with another approach to 
the true scaling, already discussed in Ref. . This has 
to do with the Hamiltonian derivation of Levy statistics 
mentioned in Section 0. We study the time evolution of 
the probability distribution of the diffusion variable x, of 
the fluctuating variable £ and of all other variables that 
might be responsible for the fluctuations of £. Then, we 
make a trace over all the "irrelevant" variables, namely, 
all the variables but x. The resulting equation of mo- 
tion is not Markovian, and no ordinary method to make 
the Markovian approximation can be applied. This is so 
because the projection method yields a time convoluted 
diffusion equation with a memory term given by the cor- 
relation function $^(i) of Eq.(17), which is not integrable. 
Consequently a new way to make the Markovian approx- 
imation also in this case was invented Jll| l . It was noticed 
that this approximation changes the time convoluted dif- 
fusion equation into a master equation | j35| . To derive 
from it a result consistent with that of the CTRW used 
in an earlier work of Zumofen and Klafter |p6|, a nd with 
Levy statistics as well, the authors of Ref. [{35| had to 
use as a bridge the master equation method of Ref. |37| . 
This master equation gets the form of a fractional deriva- 
tive, and, the resulting diffusion process coincides with 
the predictions of the GCLT, with a diffusion strength b 
that coincides with that of Eq.(p3|). It comes to be a sur- 
prise, therefore, that the recent work of Ref. (3^] proves 
that the exact solution of the time convoluted diffusion 
equation yields the same scaling as the VM, namely, the 
scaling of Eq. (|2^) . This suggests that densities and tra- 
jectories might not speak the same language in the case 
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of non-ordinary statistical mechanics, and it makes much 
stronger than ever the need for detecting the correct scal- 
ing of a time series. 



V. THE ALGORITHM 



Let us consider a sequence of M numbers 



6, 



,M. 



(29) 



The purpose of the DEA is to establish the possible ex- 
istence of a scaling, either normal or anomalous, in the 
most efficient way as possible without altering the data 
with any form of detrending. Here we describe the algo- 
rithm adopted in this paper. 

Let us select first of all an integer number I, fitting 
the condition 1 < I < M . This integer number will be 
referred to by us as "time" . For any given time / we can 
find M — I + 1 sub-sequences of length I defined by 



2 = 1,. 



(30) 



with s = 0, . . . , M— I. For any of these sub-sequences we 
build up a diffusion trajectory, s, defined by the position 



,(») 



(0 



£d s) 

i=l 



I 

= £6 

i=l 



+ .s- 



(31) 



Let us imagine this position as that of a Brownian par- 
ticle that at regular intervals of time has been jumping 
forward of backward according to the prescription of the 
corresponding sub-sequence of Eq. ( J30| ) . This means that 
the particle before reaching the position that it holds at 
time I has been making I jumps. The jump made at the 
i-th step has the intensity and is forward or back- 

(s) 

ward according to whether the number Q is positive or 
negative. 

We are now ready to evaluate the entropy of this dif- 
fusion process. In order to do so we have to partition 
the a;-axis into cells of size e(l) and to count how many 
particles are found in the cell i at a given time I. We 
denote this number by iVj(Z). Then we use this number 
to determine the probability that a particle can be found 
in the i-th cell at time I, Pi(l), by means of 



MO 



Ni(l) 



(M-l + 1) 

The entropy of the diffusion process at time I is: 



s<*(0 = -I>(0MPi(0]- 



(32) 



(33) 



from the continuous-time and continuous-space picture 
of Eq. (||) The easiest way to proceed with the choice of 
the cell size, e(l), is to assume it independent of I and 
determined by a suitable fraction of the square root of 
the variance of the fluctuation . 

In this paper we study sequences of numbers & = +1 
or — 1. Because at any step, the jump has the intensity 
equal to 1, the most reasonable choice of the cell size is 
given by e(l) — 1. In this way any cell corresponds to a 
unique position x(l) of the diffusion trajectory defined in 
( p0| ) and (pTj) . Moreover, e(l) — 1 is the square root of 
the variance of the random dichotomous fluctuation of 
intensity equal to 1. 

Few remarks about the meaning of the integer number 
I are necessary for the reader to understand the content 
of the next sections. As said before, I is the length of 
a window moving all over the available sequence to de- 
fine distinct trajectories. These trajectories are used to 
produce diffusion, and consequently we shall often re- 
fer to I as time. This should not confuse the reader. 
The adoption of the term time is suggested by the for- 
mal equivalence with the processes of either normal or 
anomalous diffusion, where walker's jumps occur in time. 
Here, these jumps occur as we move from a sequence site 
to the next, and consequently time here has to do with 
the site positions. Furthermore, we shall be often using 
for this kind of time the symbol t rather than I. This has 
to do with the fact that for windows of very large size 
the integer number I becomes virtually indistinguishable 
from a continuous number. To emphasize this aspect we 
shall adopt the symbol t rather than I. 



VI. TRANSITION REGIME: RANDOM WALK 
AND LEVY WALK 

In Section |n| we have shown that scaling is equiva- 
lent to thermodynamic equilibrium with the equilibrium 
distribution F{y). We refer to the transient process nec- 
essary to realize this form of thermodynamic equilibrium 
from the initial condition with all the trajectories located 
at i = 0, as transition from microscopic dynamics to 
thermodynamics. Here we illustrate this transition in 
two different cases, ordinary Brownian motion and Levy 
walk. In the former case the transition from microscopic 
dynamics to thermodynamics can be interpreted as a 
transition from the discrete to the continuous time rep- 
resentation. In the second case the transition is more 
extended and can be still perceived after reaching the 
continuous time regime. 



Note that the subscript d stands for discrete and serves 
the purpose of reminding the reader that the numerical 
evaluation of the diffusion entropy departs by necessity 
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A. The transition regime in the case of Brownian 
walk 



The discrete perspective can be illustrated by using 
the random walk theory that is expected to apply when 
our dichotomous signal is completely random. In this 
specific case, with no correlation, the probability p m (l), 
for the random walker to be at position m after I jumps 
of intensity 1 in either positive or negative direction, is 
determined by the binomial expression fl38|: 



Pm(0 = 



1 



I \ 1 + (-l) l + m 



2l y t+m j 2 

and the diffusion entropy reads 



S d (l) 



E 



p m {l) ln\p m (l)}. 



(34) 



(35) 



=-/ 



In the continuous time limit we expect Eq. ( J10| ) to apply. 
Fig.l shows that, after a short initial regime, the dis- 
crete diffusion entropy converges to the continuous time 
prescription (solid line in figure) . In the case of Brown- 
ian walk we can interpret the transition from microscopic 
dynamics to thermodynamics as the transition from the 
binomial formula of Eq.(|34|) to the Gaussian expression 
of Eq.(§), with a = 0.5. 



B. The transition regime in the case of Levy walk 

Here we show how to build a sequence corresponding 
to the prescription of Section IV . In an earlier work |39|] 



the reader can find the illustration of an algorithm that, 
using a generator of random numbers of the interval [0, 1] , 
creates the waiting time distribution of Eq. ( |ll| ) . Here we 
illustrate a slightly different method, generating a distri- 
bution of integer times that is exactly, rather than ap- 
proximately, equivalent to a shifted inverse power law. 
This serves the purpose of making as fast as possible the 
transition from microscopic dynamics to thermodynam- 
ics, without further delay caused by the time it takes the 
distribution to become the shifted inverse power law of 
Eq.©- 

To realize this purpose, first of all we need to generate 
a series of i integer numbers L(i) according to a probabil- 
ity distribution p{L): these numbers can be interpreted 
as the lengths of strings of the sequence to build up. 
Then, for any string, we toss a coin and we fill it entirely 
with +I's or — l's, according to whether we get head or 
tail. We assign to the integer numbers L(i) the following 
inverse power law: 



p(L) = 



C 

(T + LY ' 



(36) 



where T and C = 



are two constants 



,£=1 (T+L)f _ 

related the one to the other in such a way as to real- 
ize the normalization condition without the continuous 
time assumption behind Eq.(ll). It is evident that in 



the asymptotic limit of very large times the distribution 
of Eq. ( |36| ) becomes equivalent to that of Eq. (|Tl]) . 

To create the distribution of Eq.(|36|) we proceed as 
follows. We divide the interval of real numbers [0,1] in 
infinite sectors. The L-th sector, Rl, covers the space 



Rl = 



X(L),X(L) 



C 



{T + iy 



where 



X{L) 



if L = 1, 

CELIV^ + n)" if L> 1. 



(37) 



(38) 



The length of the sector Rl is equal to the probability 
p(L) given by the Eq.(^). Then, by using a computer, 
we generate a sequence of rational random numbers T(z) 
uniformly distributed between and 1: if the rational 
number T(i) belongs to the sector Rl, the value L will 
be assigned to the element L(i) of the sequence of inte- 
ger numbers. The described algorithm and the unifor- 
mity of the sequence of rational random numbers T(i) 
assure that the sequence of integer numbers L(i) is dis- 
tributed exactly according to the power law given by the 
equation (|36|). It is worth to point out that this special 
method of creating the artificial sequence to analyze by 
means of the DEA is equivalent to that used by Zumofen 
and Klafter |Q. Of course, due to the time asymptotic 
equivalence with the condition discussed in Section y, 
even in this case the thermodynamic regime is character- 
ized by Levy statistics and the proper scaling is that of 
Eq.@. The diffusion entropy, S d {l) of Eq.(§|), is ex- 
pected to converge asymptotically to the curve of Eq. (||) . 
For example, if we set /i = 2/3 in Eq. (|36]) , on the ba- 
sis of Eq.(p4[) we expect a value 5 = 2/3. In principle, 
using the theoretical approach of Zumofen and Klafter 
p6f we might also evaluate the value of A using 
Since this is not very relevant for the present paper, we 
skip this issue, and we rest on the numerical simulation 
to conclude that A = 1, thereby reaching the conclusion 
that the asymptotic time limit is well reproduced by 



S d (t) 



1 



■ln(i) 



(39) 



From Fig. 2 we see indeed that Eq.(p9|) fits remarkably 
well the time asymptotic limit of the numerical curve. We 
see that this limiting condition is reached after a tran- 
sient that is significantly larger than that of Fig. 1. A 
satisfactory discussion of this transient will be presented 
in Ref. §f . 
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C. DEA and SDA at work 



VII. APPLICATIONS TO DNA SEQUENCES 



In this Subsection we show the benefit of the joint use 
of DEA and SDA. The standard deviation at the diffusion 
time I, D(l), rests on the following prescription 



D(l) = 



M-l-1 



(40) 



where, according to the notation of Section [yj M is the 
sequence length, I denotes the width of moving windows 
necessary to create distinct trajectories and x denotes the 
mean value of x(l). 

According to the theoretical remarks of Section V, 
the adoption of this method applied to an artificial se- 
quence generated by the inverse power law distribution 
of Eq.(3q), with 2 < /i < 3 should yield the true scaling 
of Eq^^J). The SDA should generate the Hurst scaling of 
Eq. (gO ) . We make the analysis of five artificial sequences 
with the power indices: fi = 2.8, 2.6, 2.5, 2.4, 2.2. We 
note that at both /i = 3 and (i = 2 the two predictions 
yield the same values S = H = 0.5 and S = H = 1, 
respectively. Therefore we focus our attention on the 
intermediate values of /i. For these intermediate values 
the correct scaling, namely the Levy scaling, yields 6 = 
0.556, 0.625, 0.667, 0.714, 0.833, respectively, while the 
Hurst scaling is expected to be H — 0.6, 0.7, 0.75, 0.8, 
0.9, respectively. For the sake of reader's convenience 
this situation is summarized in Table I. The numerical 
results illustrated in Fig. 3 provide a strong support to 
the theoretical arguments of Section V, and to our claim 
about the accuracy of the DEA. In fact, we see that the 
DEA yields a remarkable agreement with the Levy scal- 
ing, while the scaling detected by the SDA virtually co- 
incides with the Hurst scaling. 

In general, when the secrete recipe driving the se- 
quence under study is not known, the comparison be- 
tween the DEA and SDA results plays an important role 
to assess the statistical nature of the process. In fact, 
in the case of Levy statistics, it is easy to show, using 
Eq.©, that 5 is related to H by 



S = 



(3 - 2H)' 



(41) 



In the FBM case, according to theoretical arguments of 
Section y, we have 



6 = H, 



(42) 



and this equality can be considered as a plausible indica- 
tion that the Gausssian condition applies. The results of 
Fig. 3 fit Eq.@, thereby confir ming the Levy nature of 
the diffusion process. 



In the last few years, thanks to the recent progress in 
experimental techniques in molecular genetics, a wealth 
of genome data has become available (see for example 
Rcf. G|). This has triggered a large interest both in the 
study of mechanics of folding |40|, and on the statistical 
properties of DNA sequences. In particular, genomes can 
be considered as long messages written in a four-letter al- 
phabet, in which we have to search for information (sig- 
nal). Recently, there have been many papers pointing out 
that DNA sequences are characterized by long-range cor- 
relation, this being more clearly displayed by non-coding 
than by coding sequences P,p|, plp^ ] . 

In this section we will study a large sample of DNA 
sequences (a dozen of both coding and non-coding se- 
quences). In particular we discuss in detail three DNA 
sequances: 

- the human T-cell receptor alpha/delta locus (Gen Bank 
name HUMTCRADCV) |^2[ , a non- coding chromosomal 
fragment (it contains less than 10% coding regions); 

- the Escherichia Coli K12 (Gen Bank name ECO110K) 
Hi, and the Escherichia Coli (Gen Bank ECOTSF) go|, 
two genomic fragments containing mostly coding regions 
(more than 80% for ECO110K). 

The three sequences have comparable lengths, M — 
97634 basis for HUMTCRADCV, M = 111401 basis for 
ECO110K and M = 91430 basis for ECOTSF, respec- 
tively. The first two sequences have been analyzed in 
Ref. [||| by means of the Detrended Fluctuation (DFA) . 
The fundamental difference between them is that the 
non-coding sequence, namely HUMTCRADCV, shows 
the presence of long-range correlation at all scales, while 
the sequence ECO110K, a coding sequence, shows the 
presence of long-range correlation only at the short-time 
scale. The third sequence, ECOTS, has been studied in 
Rcf. |l(| with the interesting conclusion that the large- 
time scale shows non-Gaussian statistics. The authors of 
Ref. p"2] ], using the illuminating example of the lambda 
phage genome, pointed out that the DFA does not mis- 
take the presence of patches of different strand bias for 
correlation. This is an important property, shared by the 
DEA method, which is widely independent of the pres- 
ence of biases, since the entropy increases mainly as a 
consequence of the trajectories departing from one an- 
other. In this Section we show that the DEA method 
makes it possible to relate the non- Gaussian statistics and 
the anomalous scaling of the large-time scale to the same 
cause: the onset of Levy statistics . 



A. The numerical representation of DNA 

The usual way to study the statistical properties of 
DNA is to consider a sequence of four bases: adenine, 



8 



cytosine, guanine, and thymine (respectively A, C, G, 
and T) , at the simplified level of a dichotomous sequence 
of two symbols, purine (for A and G) and pyrimidine (for 
C and T). A trajectory, the so-called DNA walk, can be 
extracted by considering a one-dimensional walker asso- 
ciated to the nucleotide sequence in the following way: 
the walker takes one step up when there is a pyrimidine 
in the nucleotide and a step down when there is a purine. 
The DNA sequence is therefore transformed in a sequence 
£i, i = 1, M, of numbers +1 or —1. 

As pointed out at the end of Section [v| we associate 
the site position along the sequence with time. Thus, i is 
conceived as a discrete time, and the walker makes a step 
ahead or backward, according to whether at time i the 
random walker sees +1 or —1, namely if the i-th site of 
the DNA sequence hosts a pyrimidine or a purine. The 
displacement of the walker after I steps is x(l) — Yli=i d 
and is reported in Fig. 4 for the three sequences under 
consideration. 



B. The three variance methods (VM) at work: 
non-coding and coding DNA sequences 

This section is devoted to illustrating the three differ- 
ent realizations of the variance method (VM), namely, 
the Detrended Fluctuation Analysis (DFA) Q, the 
Standard Deviation Analysis (SDA) , and the Wavelets 
Spectral Analysis (WSA) We have already dis- 

cussed the first two methods in the previous sections. 
We have also showed some results of the application of 



SDA to an artificial sequence in section VI C . As to the 
WSA, it was first adopted to study DNA sequences by 
Arneodo and collaborators in Ref . , and it consists in 
reporting the square root of the wavelet variance. In this 
way, the scaling is comparable to those detected by DFA 
and SDA, and, as we shall see, it gives indeed the same 
results. 

The first property we notice is that all the three series 
present "patches" , i.e. excess of one type of nucleotide. 
In the DFA of ref. Jl2| , Stanley and collaborators adopt 
a detrending procedure to detect the true scaling, since 
the steady bias hidden in the data can produce effects 
which might be mistaken for a striking departure from 
Brownian diffusion, while the interesting form of scalings 
must be of totally statistical nature. They define a de- 
trended walk by subtracting the local trend from the orig- 
inal DNA walk and then they study the variances F(l) of 
the detrended walk. If the walk is totally random, as in 
the ordinary Brownian motion, no correlations exist and 
F(l) ~ l 1/2 . On the contrary, the detection of F(l) ~ l H 
with either H > 1/2 or H < 1/2 is expected to imply 
the presence of extended correlation, which, in turn, is 
interpreted as a signature of the complex nature of the 
observed process. 



To illustrate the results of these authors, let us limit 
to the long-time region the adoption of the symbol H, 
which, according to Section II, is used by us to denote 
the scaling emerging from the VM. When the VM method 
is applied to the short-time region let us call the scaling 
parameter determined by the VM with the symbol H' . 
Stanley et al. |l^] found a scaling exponent H' = 0.61 for 
the non-coding intron sequence HUMTCRADCV, and 
H' = 0.51 for the intronless sequence ECO110K. They 
claim that their detrending method is able to avoid the 
spurious detection of apparent long-range correlations 
which are the artifacts of the patchiness. 

We are now ready to show the three methods at work 
on the DNA data sets we want to study in this paper. 

Non-coding DNA. Fig. 5 refers to the sequence 
HUMTCRADAVC and shows that, within the statisti- 
cal error, the three VM techniques yield the same long- 
time scaling, more precisely the three scaling exponents 
H obtained are 0.59 ± 0.01 (SDA), 0.60 ± 0.01 (DFA), 
0.61 ±0.01 (WSA). 

Coding DNA. In Figs. 6 we study the two se- 
quences ECO110K and ECOSTS, and we show that the 
same equivalence applies to both short-time and long- 
time scaling. In fact for both sequences we find that 
W is 0.53 ± 0.01 (SDA), 0.52 ± 0.01 (DFA), 0.52 ± 0.01 
(WSA), and that H is 0.73 ± 0.01 (SDA), 0.75 ± 0.01 
(DFA), 0.74 ±0.01 (WSA). 

Before moving to illustrate the results obtained by the 
DEA, some comments are in order. DFA detects the scal- 
ing in the long-time region later because of the detrend- 
ing that cuts off long local trend. In Ref. |l2|], Stanley 
and collaborators were interested in studying the scal- 
ing in the short-time region in order to distinguish the 
non-coding from the coding DNA sequences. The DFA 
aims at making more visible this regime. However, we 
think that it is more convenient to study the signal as 
it is, since detrending might erase important information 
as well as the deceiving indication of a correlation that 
does not exists. 

Figs. 5 and 6 show that there is no difference between 
SDA and WSA. This is because the Wavelet Transform 
behaves like the Fourier Transform that studies the vari- 
ance of the signal. Therefore, WSA, as Fourier Spectral 
Analysis, can detect the true scaling only in the Gaussian 
case. In all other cases, WSA detects only the variance 
scaling, and this, as pointed out in Section IV, may not 
coincide with the true scaling. 



C. The Copying Mistake Map: a model for DNA 
sequences 

According to the dynamical model of Ref. || a non- 
coding DNA sequence corresponds to an artificial se- 
quence with inverse power law long-range correlation as 
the Levy walk of Section IV , examined by means of the 
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DEA in Section VI. On the other side, a coding sequence 
can be obtained by adopting a kind of generalization of 
the Levy walk. This generalization becomes a model 
called Copying Mistake Map (CMM) g. This model 
rests on two sequences of +'s and — 's, running indepen- 
dently the one from the other. The former sequence is 
the correlated sequence studied in Section VI C by means 
of the joint use of DEA and SDA. The latter sequence is 
obtained by tossing a coin. According to the CMM, the 
generic i — th site of the DNA sequence is assigned the 
symbol pertaining to the i — th site of either the former or 
the latter sequence. The former sequence is selected with 
probability pl and the latter sequence with probability 
Pr = 1 — pl- In the case of coding sequences usually the 
condition 



PR. > PL 



(43) 



applies. The authors of Ref. jlO|] pointed out that the 
CMM model is equivalent to an earlier model [p|j4l| 
called Generalized Levy Walk (GLW). The CMM (and 
the GLW, as well, of course) yields, for short times, a dif- 
fusion process indistinguishable from ordinary Brownian 
motion. At large times, however, the long-range correla- 
tion predominates. In Ref. |l(| the CMM was adopted to 
account for the properties of prokaryotes, for which a sig- 
nificant departure from Gaussian statistics occurs. One 
of the coding sequences studied here, namely ECOTSF, 
is the same as one discussed in Ref. Q. It produces 
strong deviations from Gaussian statistics. On the ba- 



sis of that, and of the results of Section VI B, we expect 



also for coding sequences at large times a scaling param- 
eter 5 corresponding to the Levy statistics, and so, to the 
prediction of both Eq.@ and Eq.@. 

The CMM is a model flexible enough as to move from 
the Gaussian to the Levy condition. This is done sim- 
ply setting pr = 0. On the other hand, if the condition 
of Eq.(^||) applies, in the long-time limit we expect the 
condition of Levy statistics will emerge again. This is so 
because the most evident sign of Levy statistics is given 
by the power law character of the distribution tails. The 
correlated component of the CMM model results in a 
process of diffusion faster than ordinary diffusion, and so 
faster than the diffusion generated by the random compo- 
nent. As a consequence, the distribution tails are forced 
to get the character of an inverse power law. 



D. DEA at work: non-coding and coding DNA 
sequences 

By using the DEA algorithm we can detect the exis- 
tence of scaling, either normal or anomalous, Gaussian 
or Levy, in a very efficient way, and without altering the 
data with any form of dctrcnding. We analyze the data 
of both the coding and non-coding sequences. Starting 



from the sequence , i = 1, N we create the diffusion 
trajectories and we compute the diffusion entropy Sd(l) 
according to equation (|33|). The results are reported in 
Figs. 7-9. We determine the scaling as the slope of the 
tangent of the curve Sd(r). As for the second moment 
scaling, called H or H' , according to whether it refers 
to long or short times, we adopt for the DEA scaling the 
corresponding symbols S and 6'. It is evident that S is the 
true scaling. As to the meaning of 6', it will be discussed 
at the end of this section. 

Non-coding DNA. First of all let us consider the 
non-coding sequence HUMTCRADCV. Fig. 7a shows 
that the DEA results in what seems to be a time de- 
pendent scaling. This is pointed out by means of the two 
straight lines of different slopes, 8' = 0.615 ± 0.01 and 
6 = 0.565 ± 0.01 Anomalous diffusion shows up at both 
the short-time and the long-time scale, and this seems to 
be a common characteristic of non-coding sequences, sup- 
ported also by the application of our technique to other 
non-coding DNA sequences. Moreover, we notice that 
the scaling in the short-time regime 6' — 0.615 ±0.01 co- 
incides with the value found by means of the DEA analy- 
sis H' = 0.61 ±0.01. The authors of Ref. Q assign 
this scaling value to both the short and the long-time 
regime, while the DEA detects a different scaling at long 
times. Fig. 7b shows the result of the DEA applied to an 
artificial sequence built up according to the CMM pre- 
scription so as to mimic the sequence HUMTCRADCV. 
We use both [i and pr as fitting parameters. In this case, 
the intensity of the random component is not predomi- 
nant, as in the case of the coding sequences, which are 
known jl(| to require the condition of Eq. (^3[) . In fact, 
in this case the best fit between the real and the CMM 
sequence is obtained by setting pr = 0.56 ± 0.02. As to 
/j,, the value of it emerging from this fitting procedure, is 
considered by us to be the best estimate of this inverse 
power law index. This value is fi = 2.77±0.02. If we plug 
it into Eq.@, we get H = 0.615 ± 0.01, which is in fact 
the scaling detected in Ref. |l2]. This means that the 
short-time region obeys the FBM statistics. If we plug it 
into Eq.(|24|) we obtain 6 — 0.565±0.01, which is the slope 
of the DEA curve in the long-time regime, thereby prov- 
ing that the relation between <5, the true scaling, and H 
obeys the condition of Eq.(|4l|), which is, as erlier pointed 
out, a clear indication of Levy statistics. We consider this 
to be a compelling evidence that at this long-time scale 
Levy rather that FBM diffusion is generated. 

Coding DNA. In Figs. 8 and 9 we turn to the more 
delicate problem of coding sequences. The first sequence 
(ECO110K) has already been studied by means the DFA 
analysis in Ref. [0. The DFA finds H 1 = 0.52 ± 0.01 at 
the short-time scale and H = 0.75±0.01 in the large-time 
scale. The second sequence (ECOTSF) has been ana- 
lyzed in Ref. || by using four different methods. The first 
was the SDA discussed in Section C. This is a method of 
analysis less sophisticated than the DFA, since does not 
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imply any local detrending. The second and third meth- 
ods were the DFA and the Hurst analysis pl| , respec- 
tively. The fourth method used was the Onsager regres- 
sion analysis, a method that, in that context, provides 
information on the correlation function of the fluctua- 
tion £, which has an inverse power dependence on time 
/ with the power index j3 = fi — 2. The authors of Ref. 
H , by using essentially the first method and the Onsager 
regression analysis, reached the conclusion that the most 
plausible value of the scaling parameter in the long-time 
region is H — 0.75 ± 0.01 that is equivalent to the expo- 
nent H = 0.74 ±0.01 found in Figs. 6. It is interesting 
to remark that the coincidence among the different pre- 
dictions about scaling, and especially that between the 
second moment technique and the Hurst analysis, im- 
plies the adoption of the Gaussian assumption jl2| . On 
the other hand, when that condition does not apply and 
the two scaling predictions are different, to the best of 
our knowledge, it does not seem to be known what is the 
meaning of any of them. Furthermore, the authors of 
Ref. Jl0| pointed out that the statistics of the long-time 
regime is too poor to support any claim on the departure 
from the Gaussian condition. In conclusion, in Ref. ]Tc| ] 
the claim that the DNA statistics is of Levy kind was 
essentially based on the assumption that the dynamical 
theory of Refs. §,[flj is a reliable approach to the stati- 
stics of DNA sequences. No direct evidence was provided. 

The DEA method allows us to prove that the conjec- 
ture of the authors of Ref. [jl0| is correct: the results illus- 
trated in Figs. 8 and 9 afford a convincing proof that the 
DNA statistics is of Levy kind. Figs. 8a and 9a clearly 
show the difference between the slope at short time (re- 
spectively 8' = 0.52 ±0.01 in Fig. 8a and 8' = 0.53 ±0.01 
in Fig. 9a) which, in this case, is very close to that of 
ordinary random walk, and the slope at long time that 
corresponds to 8 = 0.665 ± 0.01. Since we know that 
in both cases the long-time slope provided by the DFA 
is H = 0.75 ± 0.01, we conclude that in both cases the 
condition of Eq. (^l|), indicating Levy statistics, applies. 
Figs. 8b and 9b aim at fitting the curves produced by the 
DEA method, applied to the real sequences by means of 
the CMM model. The purpose is not only that of proving 
that the CMM can become so close to the real results as 
to be virtually indistinguishable from them. It is also a 
way, already applied in Fig. 7b, to derive very accurate 
values for the power index /i. A very good agreement 
is obtained by setting p R = 0.943 ± 0.01 for ECO110K 
(Fig.5b) and p R = 0.937 ± 0.01 for ECOTSF (Fig.7b). 
The very good fitting accuracy supports the physical 
reasons that led the authors of Ref. J8| to propose the 
CMM model for coding sequences. In fact, with the large 
weight, pu — 0.937 ± 0.01, assigned to the random com- 
ponent, the scaling values become 8' = 0.52 ± 0.01 and 
8' = 0.53 ± 0.01, namely, very close to the conventional 
scaling 8 — H = 0.5. This normal condition lasts for an 
extended period of time, and eventually, at larger times 



the transition to a larger scaling takes place. 

We note that the authors of Ref. |ll| find anoma- 
lous diffusion in a statistical condition that they claim 
to be Gaussian. According to the result of Ref. JlT| , 
the Gaussian condition is incompatible with a station- 
ary diffusion process generated by a dichotomous fluctu- 
ation yielding a non integrable correlation function with 
an inverse power law character. This dichotomous fluc- 
tuation is expected to generate Levy rather than Gauss 
statistics. The authors of Ref. || studied under which 
physical condition FBM is allowed to show up, in ap- 
parent conflict with the conclusions of Ref. pT| . With 
the help of a fractal model for the DNA folding, the au- 



thors of Ref. 
per of Ref. 



proved that FBM, advocated by the pa- 
, is possible as a form of non-stationary 
process. Thus, in principle, the arguments of the work 
of Ref. Jll} would not rule out the possibility that the 
changing slope is a manifestation of a FBM with a time 
dependent scaling. This would be another form of tran- 
sition from dynamics to thermodynamics, of extremely 
large time duration. However, this way of establishing 
a compromise between the compelling prediction of the 
GCLT, according to which a dichotomous process with 
long-range correlation (2 < /i < 2 must produce Levy 
statistics, and the conclusion of some authors that this 
statistics is Gaussian |l3| is ruled out by the statistical 
analysis of the present paper, which is made much more 
accurate than the earlier approaches by the DEA. This 
is made compelling not only by the results illustrated in 
Figs. 8 and 9, but also by a plenty of statistical measure- 
ments on different DNA sequences, reported in Table II. 
All these results prove that the equality of Eq. (|4l]) , im- 
plying Levy statistics, applies to both kind of sequences. 
This means that in both cases the long-time limit is char- 
acterized by Levy statistics and that this is the form of 
non-Gaussian statistics revealed by the analysis of Ref. 

B 

We can now address the delicate problem of the tran- 
sition from 8' to 8. On the basis of the results of Figs. 8 
and 9, we would be tempted to conclude hat the CMM is 
a reliable dynamical model for DNA sequences. If this is 
correct, the transition from 8' to 8 is really a time depen- 
dent scaling. In fact, according to the CMM the short- 
time scale is dominated by the random component, due 
to the fact that the condition of Eq.(|43|) applies. In the 
case of Fig. 7 the transition from 8' to 8 is probably dom- 
inated by a completely different effect. This is the slow 
transition from dynamics to thermodynamics discussed 
in Section 



IV 



E. Significance of the results obtained 

To properly appreciate the significance of the results 
of this section, it is necessary to say a few words about 
the two different scaling prescriptions of Eqs.d24) and 
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(pp|). The scaling prescription of Eq.(pO|) is determined 
by the adoption of the variance method, as clearly illus- 
trated by the dynamical approach to the DNA sequences 
of Ref . H . This prescription is not ambiguous if the con- 
dition of Gaussian statistics applies. In fact, a Gaussian 
distribution drops quickly to zero, and the existence of a 
finite propagation front does not produce any significant 
effect. It has to be pointed out, in fact, that the adop- 
tion of the Brownian landscape proposed in the pioneer 
papers of Refs. ]2|,|],[l2| implies the existence of a prop- 
agation front moving with ballistic scaling (5 = 1). In 
other words, if we find a window of length I filled with 
only 1 's or with only — l's, this means a trajectory trav- 
elling with uniform velocity, and the x-space at distances 
from the origin larger that / is empty. The existence of 
a propagation front does not have big consequences in 
the case of Gaussian statistics, since the population at 
the propagation front is essentially zero in that case. It 
is not so in the case of Levy statistics, though, due to 
the existence of very long tails in that case. Therefore 
the Levy processes resulting from these sequences are es- 
sentially characterized by the presence of two distinct 
scaling prescriptions, the Levy prescription of Eq. (24), 



quences of Section VI B, that the DEA is a method of 



concerning the portion of distribution enclosed between 
the two propagation fronts, and the scaling S = 1, of the 
propagation front itself. The scaling of the variance of 
Eq. ( p0| ) does not reflect correctly either of these two dif- 
ferent scaling prescriptions, being a kind of compromise 
between the two. The scaling of the distribution enclosed 
by the two propagation fronts is, on the contrary, a gen- 
uine property that corresponds to the prediction of the 
GCLT Q . It is very satisfactory indeed that the DEA 
method makes this genuine form of scaling emerge. Fur- 
thermore, the DEA is a very accurate method of scal- 
ing detection, as proved by the fact that it reveals the 
existence of Levy statistics in the case of the coding se- 
quence. In this pointed out by the authors of 
Rcf . [|| , the ordinary methods become inaccurate due to 
the poor statistics available in the long-time limit. 

Another important result of this section is that it con- 
firms the validity of the CMM model. This model is 
expected to generate Levy statistics not only in the case 
of non-coding sequences, where it is easier to reveal this 
property. It predicts Levy statistics also in the case of 
coding sequences as the one here analyzed. In Ref. ||] 
the emergence of Levy statistics was conjectured but not 
proved, due to the fact that in that paper the observation 
was made monitoring the probability distribution p(x, t). 
As already pointed out, the lack of sufficient statistics 
makes it difficult to assess if the distribution p(x,t) has, 
or not, tails with an inverse power law character. In 
Rcf. |h]] a clear deviation from the Gaussian condition 
was detected in the long-time limit, but, again, no direct 
evidence was found that this deviation from Gaussian 
statistics takes the form of Levy statistics. The results 
of this section prove, with the help of the artificial se- 



analysis so accurate as to assess with good accuracy the 
property of Eq.(|4l|), and with it, the emergence of Levy 
statistics for both coding and non-coding sequences. 

In conclusion, this paper lends support, with the help 
of the DEA, an efficient technique of scaling detection, 
to the claims of Allegrini et al |ll| about the controversy 
between Voss B and the authors of Ref. || . The differ- 
ences in the findings of the groups-long-range correlations 
being ubiquitous in DNA sequences by Voss Q and such 
correlations being absent in Ref. motivated the au- 
thors of Ref. [|ll| to develop a phenomenological model, 
the CMM, that might have mitigated the differences be- 
tween the two apparently conflicting perspectives. The 
validity of the point of view of these authors is fully con- 
firmed, since Levy statistics, and consequently long-range 
correlations, seem to be ubiquitous, being a property of 
the long-time regime of both coding and non-coding se- 
quences, while the properties of ordinary Brownian mo- 
tion are confined to the short-time regime of coding se- 
quences. 



VIII. CONCLUSIONS 

This paper shows that long-range correlations result 
in a very slow transition to scaling, regarded as a form 
of thermodynamic equilibrium. The standard methods 
of statistical analysis (variance methods) are a source of 
misleading information in this case: the first being mis- 
taking the regime of transition to scaling as either ordi- 
nary or anomalous scaling. The second is that the scaling 
value, as determined by the evaluation of the second mo- 
ment, might significantly depart from the correct one. 
All the VM techniques are shown to be affected by this 
limitation, while the DEA is the only technique always 
yielding the correct scaling value, if the scaling condition 
applies. The application to the study of DNA sequences 
reported in this paper yields: 

1) a striking example of how the standard techniques 
can produce misleading conclusions 

2) a suggestive example of the power of the DEA, 
which, in this case, is able to indicate clearly that both 
coding and non-coding sequences generate Levy statistics 
in the long-time limit. 

The DEA is not only a method of scaling detection. Its 
entropic nature gives also useful insights into the regime 
of transition from dynamics to thermodynamics. It is 
possible to prove that in the special case where the time 
series is generated by fluctuations around a locally vary- 
ing bias the regime of transition to the scaling regime is 
significantly delayed. The ordinary techniques of analysis 
mistaken this transition regime as a form of anomalous 
memory, while the DEA makes it possible to establish 
the genuine nature of the process under study. This is 
left as a subject for further applications. 
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FIG. 1. Diffusion entropy of a random walker as a function of the number of jumps 1. The dashed lines and the solid line 
denote the discrete diffusion entropy Sd(l), of Eq.(^) and the continuous prescription of Eq.(|^), respectively. After a short 
transient the dashed line converges to the solid line. 
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FIG. 2. Diffusion entropy of the Levy process generated by an articial sequence, fj, corresponding to the power coefficient 
(i — 2.5 and T = 0. The dashed line is the diffusion entropy, Sd(l), in the discrete-space perspective given by the Eq.(^). The 
solid line is the diffusion entropy S(l) in the continuous-space perspective given by the Eq.(Q9|). After an initial transient, the 
dashed line converges to the solid line. 
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FIG. 3. Diffusion entropy and Standard Deviation of the Levy process generated by articial sequences £i corresponding 
respectively to five different values of the power coefficient /i, namely: /i = 2.8, 2.6, 2.5, 2.4, 2.2, and T = 0. The numerical 
results of the Diffusion Entropy Analysis (DEA) (3a) and of the Standard Deviation Analysis (SDA) (3b), reported in symbols, 
are in perfect agreement with the theoretical predictions, reported as fitting lines and obtained by using respectively the values 
of the pdf scaling exponent S and of the esponents H in Table I. 
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FIG. 4. In (a) we report the DNA walk relative to HUMTCRADVC, a non-coding chromosomal fragment. In (b) and 
we report the DNA walk relative to ECO110K and ECOTSF, two coding genomic fragments. 
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FIG. 5. Application of the three variance methods SDA, DFA and WSA to the sequence HUMTCRADVC, the non-coding 
chromosomal fragment. The three methods give the same exponent H . In fact we get H — 0.59 ± 0.01 (SDA), H — 0.60 ± 0.01 
(DFA) and H = 0.61 ±0.01 (WSA), where the differences are within the error bars. Moreover H is the same both at short-time 
and long-time regions (i.e. H' — H). 
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FIG. 6. Application of the three variance methods SDA, DFA and WSA to ECO110K (a) and ECOTSF (b), the two coding 
genomic fragments. The scaling exponent H' in the short-time region is 0.53 ± 0.01 (SDSA), 0.52 ± 0.01 (DFA), 0.52 ± 0.01 
(WSA). The scaling exponent H in the long-time regions is 0.73 ± 0.01 (SDSA), 0.75 ± 0.01 (DFA), 0.74 ± 0.01 (WSA). 
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FIG. 7. Diffusion Entropy for the HUMTCRADCV (the non-coding chromosomal fragment) and its CMM simulation. Fig. 7a 
shows that the DE analysis results in a scaling changing with time. The slope of the two straight lines is 5' = 0.615 ± 0.01 in 
the short-time region, and S — 0.565 ± 0.01 in the long-time regime. Fig. 7b shows the comparison between the DEA of the real 
non-coding sequence and an artificial sequence corresponding to the CMM model: pa = 0.56 ±0.02, T = 0.43, /i = 2.77 ±0.02. 
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FIG. 8. Diffusion Entropy for the ECO110K (one of the two coding genomic fragments studied) and its CMM simulation. 
Fig. 8a shows that the DEA results in a scaling changing with time. The slope of the two straight lines is S' — 0.52 ± 0.01 at 
short-time regime, and 8 — 0.665 ± 0.01 at long-time regime. Fig. 8b shows the comparison between the DE analysis of the real 
coding sequence and an artificial sequence corresponding to the CMM model: pu = 0.943 ± 0.01, T = 45, // = 2.5 ± 0.02. 
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FIG. 9. Diffusion Entropy for the ECOTSF (the second of the two coding genomic fragments studied in this paper) and 
its CMM simulation. Fig.9a shows that the DEA results in a scaling changing with time. The slope of the two straight lines is 
5' = 0.53 ± 0.01 in the short-time region, and S = 0.665 ± 0.01 in the long-time regime. Fig.9b shows the comparison between 
the DEA of the real coding sequence and an artificial sequence corresponding to the CMM model: pr = 0.937 ± 0.01, T = 60, 
^ = 2.5 ±0.02. 
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0.900 


0.800 


0.750 


0.700 


0.600 
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0.833 


0.714 


0.667 


0.625 


0.556 



TABLE I. In the first line from the top we report the power indices of the inverse power law distributions used to create the 
artificial sequences studied in Fig. 3. In the second line from the top we report the corresponding Hurst coefficients, prediction 
of Eq.(EOJ). In the third line from the top we report the true scaling, namely the Levy scaling of Eq.d24|). 
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ECO110K 


111401 


0.74 


0.66 


0.66 


ECOTSF 


91430 


0.74 


0.66 


0.66 


LAMCG 


48502 


0.85 


0.77 


0.76 


CHKMYHN 


7003 


0.74 


0.66 


0.66 


DDIMYHC 


6680 


0.68 


0.61 


0.61 


DROMYONMA 


6338 
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0.64 


HUMBMYH7CD 


6008 


0.63 


0.57 


0.58 
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13957 


0.69 
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TABLE II. Values of the scaling exponents H and S for a set of different coding and non-coding sequences. In the first 
column we report the GenBank name of the sequence Q, and in the second column the length N of the sequence. For all 
measures the error is ±0.01. Sh in the fourth column is the theoretical value for 8 if the Levy ondition applies, Eq.(|4l|). If 
the length of the genome is larger than 20,000 the fitted region is 100 < I < 2000. If the length of the genome is shorter than 
20,000, the statistics are not very good for large I. In this case, the fitted region is 20 < / < 200. 
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